Mutual Information Score (mutual_info_score)#

mutual_info_score measures the mutual information (MI) between two discrete labelings (often: two clusterings).

Think of it as: “How much does knowing clustering A reduce uncertainty about clustering B?”

Goals

  • Define MI using entropy and the joint distribution

  • Compute MI from a contingency table (counts)

  • Implement mutual_info_score from scratch in NumPy (and verify vs scikit-learn)

  • Visualize how MI behaves under label permutation, random independence, and label noise

  • Use MI as a non-differentiable objective to tune a simple logistic-regression decision threshold

  • Understand pros/cons and when to prefer NMI/AMI

Quick import (scikit-learn)

from sklearn.metrics import mutual_info_score

Units: scikit-learn uses the natural log, so MI is measured in nats (use log2 for bits).

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.metrics import (
    adjusted_mutual_info_score,
    mutual_info_score,
    normalized_mutual_info_score,
)
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

1) What does mutual_info_score measure?#

We observe two label arrays of length \(n\):

  • labels_true: a partition \(U\) (e.g., ground-truth classes or clustering A)

  • labels_pred: a partition \(V\) (e.g., predicted clusters or clustering B)

Mutual information \(I(U;V)\) is:

  • 0 when \(U\) and \(V\) are independent (knowing one tells you nothing about the other)

  • large when \(U\) and \(V\) are strongly dependent (knowing one reveals a lot about the other)

In clustering, MI is popular because it is permutation-invariant: renaming clusters (0 ↔ 2, etc.) does not change the score.

2) Mutual information for discrete variables#

Let \(U\) and \(V\) be discrete random variables with joint pmf \(p(u,v)\) and marginals \(p(u)\), \(p(v)\).

Entropy (uncertainty)#

\[ H(U) = -\sum_u p(u)\,\log p(u) \]

Mutual information (shared information)#

Equivalent definitions:

\[ I(U;V) = \sum_{u}\sum_{v} p(u,v)\,\log\frac{p(u,v)}{p(u)p(v)} = H(U) + H(V) - H(U,V) = H(U) - H(U\mid V) \]

Interpretation:

  • \(p(u)p(v)\) is what the joint would look like if \(U\) and \(V\) were independent

  • the log-ratio compares observed co-occurrence vs independence

  • cells where \(p(u,v) > p(u)p(v)\) contribute positively; cells where \(p(u,v) < p(u)p(v)\) contribute negatively, but the total \(I(U;V)\ge 0\)

3) From label arrays to MI: the contingency matrix#

Given \(n\) paired observations \((u_i, v_i)\), define the contingency table counts:

  • \(n_{ij}\) = number of samples with \(u=i\) and \(v=j\)

  • \(n_{i*} = \sum_j n_{ij}\) (row sums)

  • \(n_{*j} = \sum_i n_{ij}\) (column sums)

  • \(n = \sum_{i,j} n_{ij}\) (total)

A common “plug-in” estimator replaces probabilities by empirical frequencies:

\[ \widehat{I}(U;V) = \sum_{i}\sum_{j} \frac{n_{ij}}{n}\,\log\left( \frac{n\,n_{ij}}{n_{i*}n_{*j}} \right) \]

By convention, terms with \(n_{ij}=0\) contribute \(0\) (since \(0\log 0 := 0\)).

This is what sklearn.metrics.mutual_info_score computes (using the natural log).

def contingency_matrix_numpy(labels_true, labels_pred):
    """Build the contingency matrix N where N[i, j] counts samples with
    true label i and predicted label j (after re-indexing labels to 0..K-1).
    """
    labels_true = np.asarray(labels_true)
    labels_pred = np.asarray(labels_pred)
    if labels_true.shape[0] != labels_pred.shape[0]:
        raise ValueError("labels_true and labels_pred must have the same length")

    true_values, true_inv = np.unique(labels_true, return_inverse=True)
    pred_values, pred_inv = np.unique(labels_pred, return_inverse=True)

    n_true = true_values.shape[0]
    n_pred = pred_values.shape[0]

    flat = true_inv * n_pred + pred_inv
    counts = np.bincount(flat, minlength=n_true * n_pred)
    contingency = counts.reshape(n_true, n_pred)

    return contingency, true_values, pred_values
def mutual_info_score_numpy(labels_true, labels_pred):
    """NumPy implementation of mutual information between two labelings.

    Matches sklearn.metrics.mutual_info_score (natural log, result in nats).
    """
    contingency, _, _ = contingency_matrix_numpy(labels_true, labels_pred)
    contingency = contingency.astype(float)

    n = contingency.sum()
    if n == 0:
        return 0.0

    row_sum = contingency.sum(axis=1)
    col_sum = contingency.sum(axis=0)

    i_idx, j_idx = np.nonzero(contingency)
    nij = contingency[i_idx, j_idx]

    mi = (nij / n) * np.log((nij * n) / (row_sum[i_idx] * col_sum[j_idx]))
    return float(mi.sum())
# Quick verification vs scikit-learn
for k_true, k_pred in [(3, 3), (4, 2), (10, 10)]:
    labels_true_rand = rng.integers(0, k_true, size=500)
    labels_pred_rand = rng.integers(0, k_pred, size=500)

    mi_sklearn = mutual_info_score(labels_true_rand, labels_pred_rand)
    mi_numpy = mutual_info_score_numpy(labels_true_rand, labels_pred_rand)

    print(
        f"k_true={k_true:>2}, k_pred={k_pred:>2} | sklearn={mi_sklearn:.6f} numpy={mi_numpy:.6f} | diff={abs(mi_sklearn - mi_numpy):.2e}"
    )
k_true= 3, k_pred= 3 | sklearn=0.003990 numpy=0.003990 | diff=2.64e-16
k_true= 4, k_pred= 2 | sklearn=0.004427 numpy=0.004427 | diff=5.85e-16
k_true=10, k_pred=10 | sklearn=0.079174 numpy=0.079174 | diff=1.94e-16

4) Intuition: permutation, independence, and noise#

We’ll build a small toy labeling with 3 clusters and compare three predicted labelings:

  1. Perfect but permuted: same partition, different label ids (MI should be maximal)

  2. Independent random labels: no relationship (MI near 0)

  3. Noisy labels: flip some fraction of labels at random (MI decreases with noise)

n_per_cluster = 150
labels_true = np.repeat(np.arange(3), n_per_cluster)
rng.shuffle(labels_true)

# 1) perfect but permuted
perm = np.array([2, 0, 1])
labels_perm = perm[labels_true]

# 2) independent random labels (same number of clusters)
labels_random = rng.integers(0, 3, size=labels_true.shape[0])


# 3) noisy labels: with prob p, replace by a random label
def add_label_noise(labels, n_labels, p_noise, rng):
    labels = np.asarray(labels).copy()
    mask = rng.random(labels.shape[0]) < p_noise
    labels[mask] = rng.integers(0, n_labels, size=mask.sum())
    return labels


labels_noisy = add_label_noise(labels_true, n_labels=3, p_noise=0.3, rng=rng)

for name, lab in [
    ("perfect (permuted)", labels_perm),
    ("random (independent)", labels_random),
    ("noisy (p=0.3)", labels_noisy),
]:
    print(f"{name:>18}: MI={mutual_info_score_numpy(labels_true, lab):.4f} nats")
perfect (permuted): MI=1.0986 nats
random (independent): MI=0.0008 nats
     noisy (p=0.3): MI=0.4862 nats
def contingency_heatmap_trace(contingency, true_vals, pred_vals, showscale=False):
    contingency = np.asarray(contingency)
    return go.Heatmap(
        z=contingency,
        x=[str(v) for v in pred_vals],
        y=[str(v) for v in true_vals],
        colorscale="Blues",
        showscale=showscale,
        text=contingency,
        texttemplate="%{text}",
        hovertemplate="true=%{y}<br>pred=%{x}<br>count=%{z}<extra></extra>",
    )
cont_perm, true_vals, pred_perm_vals = contingency_matrix_numpy(labels_true, labels_perm)
cont_rand, _, pred_rand_vals = contingency_matrix_numpy(labels_true, labels_random)
cont_noisy, _, pred_noisy_vals = contingency_matrix_numpy(labels_true, labels_noisy)

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=[
        "Perfect (permuted)",
        "Random (independent)",
        "Noisy (p=0.3)",
    ],
)

fig.add_trace(contingency_heatmap_trace(cont_perm, true_vals, pred_perm_vals), row=1, col=1)
fig.add_trace(contingency_heatmap_trace(cont_rand, true_vals, pred_rand_vals), row=1, col=2)
fig.add_trace(contingency_heatmap_trace(cont_noisy, true_vals, pred_noisy_vals), row=1, col=3)

fig.update_xaxes(title_text="pred label", row=1, col=1)
fig.update_xaxes(title_text="pred label", row=1, col=2)
fig.update_xaxes(title_text="pred label", row=1, col=3)
fig.update_yaxes(title_text="true label", row=1, col=1)

fig.update_layout(title="Contingency tables (counts)", height=350)
fig.show()

A quick read:

  • A near diagonal contingency table means each true cluster mostly maps to one predicted cluster ⇒ high MI.

  • A near uniform table means predicted labels don’t depend on true labels ⇒ MI near 0.

ps = np.linspace(0.0, 1.0, 41)
n_repeats = 30

mi_means = []
mi_stds = []

for p_noise in ps:
    vals = []
    for _ in range(n_repeats):
        lab = add_label_noise(labels_true, n_labels=3, p_noise=float(p_noise), rng=rng)
        vals.append(mutual_info_score_numpy(labels_true, lab))
    vals = np.array(vals)
    mi_means.append(vals.mean())
    mi_stds.append(vals.std())

mi_means = np.array(mi_means)
mi_stds = np.array(mi_stds)

fig = go.Figure()

fig.add_trace(go.Scatter(x=ps, y=mi_means, mode="lines", name="mean MI"))

fig.add_trace(
    go.Scatter(
        x=np.r_[ps, ps[::-1]],
        y=np.r_[mi_means - mi_stds, (mi_means + mi_stds)[::-1]],
        fill="toself",
        fillcolor="rgba(0,0,0,0.10)",
        line=dict(color="rgba(0,0,0,0)"),
        hoverinfo="skip",
        name="±1 std",
    )
)

max_mi = np.log(3)
fig.add_hline(y=max_mi, line_dash="dash", line_color="gray", annotation_text="log(3) max")

fig.update_layout(
    title="Mutual information vs label noise",
    xaxis_title="noise probability p (replace label with random label)",
    yaxis_title="MI (nats)",
)
fig.show()

5) What MI is “adding up”: observed vs expected co-occurrences#

Each contingency cell \((i,j)\) compares:

  • observed count: \(n_{ij}\)

  • expected count under independence: \(\frac{n_{i*} n_{*j}}{n}\)

The per-cell contribution is:

\[ \frac{n_{ij}}{n}\log\left(\frac{n_{ij}}{\frac{n_{i*}n_{*j}}{n}}\right) \]

Let’s visualize those contributions.

def mi_contribution_matrix(contingency):
    contingency = np.asarray(contingency, dtype=float)
    n = contingency.sum()
    if n == 0:
        return np.zeros_like(contingency, dtype=float)

    row_sum = contingency.sum(axis=1, keepdims=True)
    col_sum = contingency.sum(axis=0, keepdims=True)
    denom = row_sum * col_sum

    ratio = np.ones_like(contingency, dtype=float)
    np.divide(contingency * n, denom, out=ratio, where=(contingency > 0))

    contrib = np.zeros_like(contingency, dtype=float)
    mask = contingency > 0
    contrib[mask] = (contingency[mask] / n) * np.log(ratio[mask])
    return contrib


contrib_noisy = mi_contribution_matrix(cont_noisy)

fig = px.imshow(
    contrib_noisy,
    text_auto=".3f",
    aspect="auto",
    color_continuous_scale="RdBu",
    color_continuous_midpoint=0.0,
    labels=dict(x="pred label", y="true label", color="contrib (nats)"),
    x=[str(v) for v in pred_noisy_vals],
    y=[str(v) for v in true_vals],
    title="Per-cell MI contributions (noisy example, p=0.3)",
)
fig.show()

print(f"Sum of contributions = {contrib_noisy.sum():.4f} nats (should equal MI)")
Sum of contributions = 0.4862 nats (should equal MI)

6) Using MI in a simple optimization loop: tune a logistic-regression threshold#

mutual_info_score needs discrete labels. For a probabilistic classifier (like logistic regression) we usually have scores/probabilities \(\hat{p}(x)\).

A common way to use MI is as a validation objective over a non-differentiable choice, e.g. the decision threshold \(\tau\):

\[ \hat{y}(\tau) = \mathbb{1}[\hat{p}(x) \ge \tau], \qquad \tau^* = \arg\max_{\tau\in[0,1]} I(Y;\hat{Y}(\tau)). \]

We’ll:

  1. fit logistic regression from scratch (gradient descent on log-loss),

  2. sweep thresholds on a validation set,

  3. pick the threshold that maximizes MI.

X, y = make_classification(
    n_samples=1600,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.8, 0.2],
    class_sep=1.2,
    random_state=42,
)

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.35, random_state=42, stratify=y
)

mean = X_train.mean(axis=0)
std = X_train.std(axis=0)
X_train_s = (X_train - mean) / std
X_val_s = (X_val - mean) / std

# add intercept column
X_train_b = np.c_[np.ones(X_train_s.shape[0]), X_train_s]
X_val_b = np.c_[np.ones(X_val_s.shape[0]), X_val_s]

fig = px.scatter(
    x=X_val_s[:, 0],
    y=X_val_s[:, 1],
    color=y_val.astype(int),
    title="Validation data (standardized features)",
    labels={"x": "x1 (standardized)", "y": "x2 (standardized)", "color": "class"},
)
fig.show()
def sigmoid(z):
    z = np.clip(z, -50, 50)  # numerical stability
    return 1.0 / (1.0 + np.exp(-z))


def fit_logistic_regression_gd(X, y, lr=0.2, n_iter=2500, l2=0.0):
    """Binary logistic regression with gradient descent on log-loss.

    X: (n, d) including intercept column if desired.
    y: (n,) in {0,1}.
    l2: L2 regularization strength (applied to weights except intercept).
    """
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    n, d = X.shape
    w = np.zeros(d, dtype=float)

    eps = 1e-12
    losses = np.empty(n_iter, dtype=float)

    for t in range(n_iter):
        p = sigmoid(X @ w)

        loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
        loss += 0.5 * l2 * np.sum(w[1:] ** 2)
        losses[t] = loss

        grad = (X.T @ (p - y)) / n
        grad[1:] += l2 * w[1:]

        w -= lr * grad

    return w, losses


w, losses = fit_logistic_regression_gd(X_train_b, y_train, lr=0.2, n_iter=2500, l2=0.1)
print("w =", w)

fig = px.line(
    y=losses,
    title="Training curve (log-loss)",
    labels={"x": "iteration", "y": "log-loss"},
)
fig.show()
w = [-1.6346  1.0995  0.1196]
def accuracy(y_true, y_pred):
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    return float(np.mean(y_true == y_pred))


def confusion_counts(y_true, y_pred):
    y_true = np.asarray(y_true).astype(int)
    y_pred = np.asarray(y_pred).astype(int)
    tp = int(np.sum((y_true == 1) & (y_pred == 1)))
    fp = int(np.sum((y_true == 0) & (y_pred == 1)))
    tn = int(np.sum((y_true == 0) & (y_pred == 0)))
    fn = int(np.sum((y_true == 1) & (y_pred == 0)))
    return tp, fp, tn, fn


p_val = sigmoid(X_val_b @ w)

taus = np.linspace(0.01, 0.99, 99)
mi_vals = np.empty_like(taus)
acc_vals = np.empty_like(taus)

for i, tau in enumerate(taus):
    y_pred_tau = (p_val >= tau).astype(int)
    mi_vals[i] = mutual_info_score_numpy(y_val, y_pred_tau)
    acc_vals[i] = accuracy(y_val, y_pred_tau)

best_i = int(np.argmax(mi_vals))
best_tau = float(taus[best_i])

y_pred_best = (p_val >= best_tau).astype(int)
y_pred_05 = (p_val >= 0.5).astype(int)

print(f"Best threshold (by MI): tau* = {best_tau:.2f}")
print(f"MI(tau*)  = {mutual_info_score_numpy(y_val, y_pred_best):.4f} nats")
print(f"MI(0.50)  = {mutual_info_score_numpy(y_val, y_pred_05):.4f} nats")
print(f"Acc(tau*) = {accuracy(y_val, y_pred_best):.3f}")
print(f"Acc(0.50) = {accuracy(y_val, y_pred_05):.3f}")

tp, fp, tn, fn = confusion_counts(y_val, y_pred_best)
print(f"Confusion@tau*: TP={tp} FP={fp} TN={tn} FN={fn}")
Best threshold (by MI): tau* = 0.36
MI(tau*)  = 0.3175 nats
MI(0.50)  = 0.2191 nats
Acc(tau*) = 0.954
Acc(0.50) = 0.914
Confusion@tau*: TP=92 FP=4 TN=442 FN=22
fig = go.Figure()

fig.add_trace(go.Scatter(x=taus, y=mi_vals, mode="lines", name="MI (nats)"))
fig.add_trace(go.Scatter(x=taus, y=acc_vals, mode="lines", name="Accuracy", yaxis="y2"))

fig.add_vline(x=best_tau, line_dash="dash", line_color="black", annotation_text="tau*")
fig.add_vline(x=0.5, line_dash="dot", line_color="gray", annotation_text="0.5")

fig.update_layout(
    title="Threshold sweep on validation set",
    xaxis_title="threshold tau",
    yaxis=dict(title="MI (nats)"),
    yaxis2=dict(title="Accuracy", overlaying="y", side="right", range=[0, 1]),
    legend=dict(x=0.01, y=0.99),
)
fig.show()
def logit(p):
    p = np.clip(p, 1e-12, 1 - 1e-12)
    return np.log(p / (1 - p))


def decision_boundary_x2(w, tau, x1_grid):
    # w0 + w1*x1 + w2*x2 = logit(tau)
    w0, w1, w2 = w
    if abs(w2) < 1e-12:
        raise ValueError("w2 is ~0; boundary would be vertical in (x1,x2)")
    return (logit(tau) - w0 - w1 * x1_grid) / w2


x1_min, x1_max = X_val_s[:, 0].min() - 0.5, X_val_s[:, 0].max() + 0.5
x1_grid = np.linspace(x1_min, x1_max, 200)

x2_tau05 = decision_boundary_x2(w, tau=0.5, x1_grid=x1_grid)
x2_best = decision_boundary_x2(w, tau=best_tau, x1_grid=x1_grid)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=X_val_s[:, 0],
        y=X_val_s[:, 1],
        mode="markers",
        marker=dict(
            color=y_val.astype(int),
            colorscale="Viridis",
            showscale=True,
            colorbar=dict(title="class"),
            size=6,
            opacity=0.8,
        ),
        name="val points",
    )
)

fig.add_trace(
    go.Scatter(
        x=x1_grid,
        y=x2_tau05,
        mode="lines",
        line=dict(color="black", dash="dot", width=3),
        name="boundary (tau=0.5)",
    )
)

fig.add_trace(
    go.Scatter(
        x=x1_grid,
        y=x2_best,
        mode="lines",
        line=dict(color="crimson", dash="dash", width=3),
        name=f"boundary (tau*={best_tau:.2f})",
    )
)

fig.update_layout(
    title="Decision boundary shifts when you change the threshold",
    xaxis_title="x1 (standardized)",
    yaxis_title="x2 (standardized)",
    height=450,
)
fig.show()

7) Pitfalls (and why NMI/AMI exist)#

Two common gotchas:

  1. MI is not normalized: values depend on the entropies of the labelings (number of clusters and their balance). Comparing MI across datasets can be misleading.

  2. MI does not penalize over-segmentation: a clustering with many tiny clusters can still have a high MI with the ground truth.

Let’s see the second pitfall.

labels_true_small = labels_true
labels_unique = np.arange(labels_true_small.shape[0])  # each sample its own cluster

for name, pred in [
    ("perfect (permuted)", labels_perm),
    ("unique cluster per sample", labels_unique),
    ("random labels (3 clusters)", labels_random),
]:
    mi = mutual_info_score(labels_true_small, pred)
    nmi = normalized_mutual_info_score(labels_true_small, pred)
    ami = adjusted_mutual_info_score(labels_true_small, pred)

    print(f"{name:>24}: MI={mi:.4f} | NMI={nmi:.4f} | AMI={ami:.4f}")
      perfect (permuted): MI=1.0986 | NMI=1.0000 | AMI=1.0000
unique cluster per sample: MI=1.0986 | NMI=0.3048 | AMI=-0.0000
random labels (3 clusters): MI=0.0008 | NMI=0.0007 | AMI=-0.0034

8) Pros, cons, and where MI shines#

Pros

  • Permutation-invariant: relabeling clusters doesn’t change the score.

  • Captures general dependence: not limited to linear relationships (for discrete variables).

  • Uses full contingency table: considers all overlaps, not only “matching” pairs.

Cons / caveats

  • Unbounded / not directly interpretable across problems (depends on label entropies).

  • Can reward over-segmentation (many clusters) → consider normalized_mutual_info_score or adjusted_mutual_info_score.

  • Requires discrete variables: continuous features need discretization or different MI estimators.

  • Non-differentiable for hard labels: not a convenient loss for gradient-based training; better as a validation metric / search objective.

Common use cases

  • External clustering evaluation when ground truth labels exist.

  • Comparing two different clusterings/segmentations of the same data.

  • Feature selection (MI between a discrete target and a discretized feature).

Exercises#

  1. Implement normalized mutual information (NMI) from scratch using entropies.

  2. Show how MI changes when clusters are highly imbalanced (e.g., 95/5 split).

  3. Use MI to select the best of several clustering hyperparameters (e.g., choose \(k\) in k-means when ground truth exists).

References#

  • scikit-learn: sklearn.metrics.mutual_info_score

  • Cover & Thomas, Elements of Information Theory (entropy, mutual information)